In this module, we will learn how we can quickly analyse gene expression data to summarise the pathway enrichment analysis, using already published tools. This is a faster way for those databases that are well documented in Org.Db package. Please note that customized analysis is required for the transcriptomes that are not present in this database.

The source of the dataset is from the following paper, we investigate the impact of root dwelling microbes on plants cultivated in lower photosynthetic active radiation: A microbiota–root–shoot circuit favours Arabidopsis growth over defence under suboptimal light

Install Dependencies

pkgs <- c("BiocManager", "tidyverse", "ggridges")
# install.packages(pkgs)
lapply(pkgs, require, character.only = TRUE)
## Loading required package: BiocManager
## Bioconductor version '3.16' is out-of-date; the current release version '3.17'
##   is available with R version '4.3'; see https://bioconductor.org/install
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: ggridges
# Bioconductor packages
org <- "org.At.tair.db" # For arabidopsis
pkgs_bc <- c("clusterProfiler", "enrichplot", "pathview", org, "DOSE")
# lapply(pkgs_bc, install, character.only = TRUE)
lapply(pkgs_bc, require, character.only = TRUE)
## Loading required package: clusterProfiler
## 
## clusterProfiler v4.6.2  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
## 
## Attaching package: 'clusterProfiler'
## 
## The following object is masked from 'package:purrr':
## 
##     simplify
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## Loading required package: enrichplot
## Warning: package 'enrichplot' was built under R version 4.2.3
## Loading required package: pathview
## 
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
## Loading required package: org.At.tair.db
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## 
## The following object is masked from 'package:clusterProfiler':
## 
##     rename
## 
## The following objects are masked from 'package:lubridate':
## 
##     second, second<-
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## 
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## 
## Attaching package: 'IRanges'
## 
## The following object is masked from 'package:clusterProfiler':
## 
##     slice
## 
## The following object is masked from 'package:lubridate':
## 
##     %within%
## 
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## 
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:clusterProfiler':
## 
##     select
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Loading required package: DOSE
## DOSE v3.24.2  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use DOSE in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609

Read the expression profile of the genes

df <- read.delim2("./data/Hou.et.al.txt", header = TRUE, row.names = 1, sep = "\t")
case_in <- "SC_Root"
id_deg <- str_detect(colnames(df), case_in)
lfc_mat <- df[,id_deg]
colnames(lfc_mat) <- str_replace_all(colnames(lfc_mat), paste0("\\.", case_in), "")

# degs <- lfc_mat %>% dplyr::filter(abs(as.numeric(`log2FC`)) >= 1 & as.numeric(`FDR`) <= 0.05) # Filter the ones that are significant
gene_list <- as.numeric(lfc_mat$log2FC)
names(gene_list) <- row.names(lfc_mat)
gene_list <- na.omit(gene_list) %>% sort(., decreasing = TRUE)

Gene Set Enrichment

The package clusterProfiler provides onle line command to perform GSEA analysis.

# One line for GSE
gse <- clusterProfiler::gseGO(geneList = gene_list, 
             ont ="ALL", 
             keyType = "TAIR", 
             nPerm = 10000, 
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = org, 
             pAdjustMethod = "none")
## preparing geneSet collections...
## GSEA analysis...
## Warning in .GSEA(geneList = geneList, exponent = exponent, minGSSize =
## minGSSize, : We do not recommend using nPerm parameter incurrent and future
## releases
## Warning in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize
## = minGSSize, : You are trying to run fgseaSimple. It is recommended to use
## fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
## in the fgsea function call.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.52% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...

Representating GSE analysis

Lets see if our gene ontology terms corresponds to some biological function.

dotplot(gse, 
        showCategory = 10, split = ".sign") + 
  facet_grid(.~.sign)

Representating Enrichment Map

Lets explore the top 5 categories in GO term.

emapplot(pairwise_termsim(gse), showCategory = 10)

Representating Enrichment Map

Now lets look at the regulatory network of the gene set.

cnetplot(gse, 
         categorySize = "pvalue", 
         foldChange = gene_list, 
         showCategory = 2)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
##  The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
## Warning: ggrepel: 430 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Representating Enrichment Map

Lets see the distibution of the p-values for the corresponding pathways.

require(ggridges)
ridgeplot(gse) + 
  labs(x = "Gene enrichment")
## Picking joint bandwidth of 0.186

Gene Set Enrichment

This is the most interesting plot where you can see the expression pattrerns of the genes on the ranked list. You can specifically trace the running enrichment score and understand the consistency of gene expression of the corresponding pathway.

gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)

gs1 <- unlist(str_split(gse$core_enrichment[1], "/"))
save(list = "gs1", file = "./gene_set_for_wounding.RData")

KEGG pathway analysis

Now, lets dive into the the pathway analysis using the KEGG database. Here, we have to first change the id that is compatible with the KEGG database. We will be using bitr function to fetch the corresponding ENTREZID from the TAIR id.

# Obtain the gene list for KEGG
keg_gene_list <- as.numeric(lfc_mat$log2FC)
names(keg_gene_list) <- row.names(lfc_mat)

# Convert gene IDs for gseKEGG function
# We will lose some genes here because not all IDs will be converted
ids <- bitr(names(keg_gene_list), 
            fromType = "TAIR", 
            toType = "ENTREZID", 
            OrgDb = org.At.tair.db)
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(names(keg_gene_list), fromType = "TAIR", toType = "ENTREZID", :
## 3% of input gene IDs are fail to map...
# Remove duplicate IDS (here I use "TAIR", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("TAIR")]), ]

# Create a new dataframe df_keg which has only the genes which were successfully mapped using the bitr function above
df_keg <- lfc_mat[row.names(lfc_mat) %in% dedup_ids$TAIR, ]

# Create a new column in df_keg with the corresponding ENTREZ IDs
df_keg$Y <- dedup_ids$ENTREZID

# Create a vector of the gene unuiverse
kegg_gene_list <- as.numeric(df_keg$log2FC)

# Name vector with ENTREZ ids
names(kegg_gene_list) <- df_keg$Y

# omit any NA values 
kegg_gene_list <- na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler) for ranking
kegg_gene_list <- sort(kegg_gene_list, decreasing = TRUE)

# Run the KEGG enrichment
kegg_org <- "ath" # Note the short form of the organism
kkg <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_org,
               nPerm        = 10000,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "none",
               keyType       = "ncbi-geneid") # Keytype is 
## Reading KEGG annotation online: "https://rest.kegg.jp/link/ath/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/ath"...
## Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/ath"...
## preparing geneSet collections...
## GSEA analysis...
## Warning in .GSEA(geneList = geneList, exponent = exponent, minGSSize =
## minGSSize, : We do not recommend using nPerm parameter incurrent and future
## releases
## Warning in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize
## = minGSSize, : You are trying to run fgseaSimple. It is recommended to use
## fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
## in the fgsea function call.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.43% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...

KEGG pathway enrichment

Here we see the transcriptional reprogramming in pathways.

dotplot(kkg, showCategory = 10, title = "Enriched Pathways" , split=".sign") + 
  facet_grid(.~.sign)

KEGG network plot for identification of the gene expression dependencies

emapplot(pairwise_termsim(kkg))

KEGG network plot for the pathway specifics

cnetplot(kkg)
## Warning: ggrepel: 213 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Distribution of differentailly expressed genes

ridgeplot(kkg) + labs(x = "Enrichment")
## Picking joint bandwidth of 0.195

GSEA Plot for KEGG terms

The enrichment analysis shows the genes corresponding to the flavonoid biosynthesis are enriched. Please note that the 1 refers to the index of flavonoid biosynthesis. You may explore other pathways by changing this.

gseaplot(kkg, by = "all", title = kkg$Description[1], geneSetID = 1)

Pathway snapshot using pathview

Finally we summarise the KEGG pathway correspnding to flavanoid biosynthesis and investigate the gene expression patterns of the pathway components.

require(pathview)

# Produce the native KEGG plot (PNG)
at_pv <- pathview(gene.data=kegg_gene_list, 
                pathway.id="ath00941", 
                species = kegg_org, 
                mid = "white", 
                high = "darkgreen", 
                low = "darkmagenta", kegg.dir = "./kegg_out/")
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /Users/akumarbasak/Documents/PA_practical_shared
## Info: Writing image file ath00941.pathview.png
knitr::include_graphics("./ath00941.pathview.png")

END

  1. You can download your favourite dataset (LogFC table). Please note that the transcriptome should be within the Org.db.
  1. Find the top 5 pathways that are enriched upon experimental perturbation by using both KEGG and GO term analyses.
  1. Check the interdependencies of the genes that are expressed within the top 5 pathways.
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DOSE_3.24.2           org.At.tair.db_3.16.0 AnnotationDbi_1.60.2 
##  [4] IRanges_2.32.0        S4Vectors_0.36.2      Biobase_2.58.0       
##  [7] BiocGenerics_0.44.0   pathview_1.38.0       enrichplot_1.18.4    
## [10] clusterProfiler_4.6.2 ggridges_0.5.4        lubridate_1.9.2      
## [13] forcats_1.0.0         stringr_1.5.0         dplyr_1.1.2          
## [16] purrr_1.0.1           readr_2.1.4           tidyr_1.3.0          
## [19] tibble_3.2.1          ggplot2_3.4.2         tidyverse_2.0.0      
## [22] BiocManager_1.30.21  
## 
## loaded via a namespace (and not attached):
##   [1] ggnewscale_0.4.9       fgsea_1.24.0           colorspace_2.1-0      
##   [4] ggtree_3.6.2           gson_0.1.0             qvalue_2.30.0         
##   [7] XVector_0.38.0         aplot_0.1.10           rstudioapi_0.15.0     
##  [10] farver_2.1.1           graphlayouts_1.0.0     ggrepel_0.9.3         
##  [13] bit64_4.0.5            scatterpie_0.2.1       fansi_1.0.4           
##  [16] codetools_0.2-18       splines_4.2.2          cachem_1.0.8          
##  [19] GOSemSim_2.24.0        knitr_1.43             polyclip_1.10-4       
##  [22] jsonlite_1.8.7         GO.db_3.16.0           png_0.1-8             
##  [25] graph_1.76.0           ggforce_0.4.1          compiler_4.2.2        
##  [28] httr_1.4.6             lazyeval_0.2.2         Matrix_1.5-1          
##  [31] fastmap_1.1.1          cli_3.6.1              tweenr_2.0.2          
##  [34] htmltools_0.5.5        tools_4.2.2            igraph_1.5.0          
##  [37] gtable_0.3.3           glue_1.6.2             GenomeInfoDbData_1.2.9
##  [40] reshape2_1.4.4         fastmatch_1.1-3        Rcpp_1.0.11           
##  [43] jquerylib_0.1.4        vctrs_0.6.3            Biostrings_2.66.0     
##  [46] nlme_3.1-160           ape_5.7-1              ggraph_2.1.0          
##  [49] xfun_0.39              timechange_0.2.0       lifecycle_1.0.3       
##  [52] XML_3.99-0.14          org.Hs.eg.db_3.16.0    zlibbioc_1.44.0       
##  [55] MASS_7.3-58.1          scales_1.2.1           tidygraph_1.2.3       
##  [58] hms_1.1.3              KEGGgraph_1.58.3       parallel_4.2.2        
##  [61] RColorBrewer_1.1-3     curl_5.0.1             yaml_2.3.7            
##  [64] memoise_2.0.1          gridExtra_2.3          downloader_0.4        
##  [67] ggfun_0.1.1            HDO.db_0.99.1          yulab.utils_0.0.6     
##  [70] sass_0.4.6             stringi_1.7.12         RSQLite_2.3.1         
##  [73] highr_0.10             tidytree_0.4.2         BiocParallel_1.32.6   
##  [76] GenomeInfoDb_1.34.9    rlang_1.1.1            pkgconfig_2.0.3       
##  [79] bitops_1.0-7           evaluate_0.21          lattice_0.20-45       
##  [82] labeling_0.4.2         treeio_1.22.0          patchwork_1.1.2       
##  [85] shadowtext_0.1.2       cowplot_1.1.1          bit_4.0.5             
##  [88] tidyselect_1.2.0       plyr_1.8.8             magrittr_2.0.3        
##  [91] R6_2.5.1               generics_0.1.3         DBI_1.1.3             
##  [94] pillar_1.9.0           withr_2.5.0            KEGGREST_1.38.0       
##  [97] RCurl_1.98-1.12        crayon_1.5.2           utf8_1.2.3            
## [100] tzdb_0.4.0             rmarkdown_2.23         viridis_0.6.3         
## [103] grid_4.2.2             data.table_1.14.8      Rgraphviz_2.42.0      
## [106] blob_1.2.4             digest_0.6.33          gridGraphics_0.5-1    
## [109] munsell_0.5.0          viridisLite_0.4.2      ggplotify_0.1.1       
## [112] bslib_0.5.0